Skip to content

Conversation

@JohanLorenzo
Copy link

No description provided.

@JohanLorenzo JohanLorenzo requested a review from a team as a code owner June 20, 2024 10:31
@JohanLorenzo JohanLorenzo requested review from lotas, matt-boris and petemoore and removed request for a team June 20, 2024 10:31
@JohanLorenzo JohanLorenzo changed the title RFC#0192 - ensures workers do not get unnecessarily killed RFC#0192 - Ensure workers do not get unnecessarily killed Jun 20, 2024
@JohanLorenzo
Copy link
Author

@lotas 👋 Do you see anything more we should discuss? If not, is okay to open up the discussion in the next community meeting?

@lotas
Copy link
Contributor

lotas commented Jul 22, 2024

@lotas 👋 Do you see anything more we should discuss? If not, is okay to open up the discussion in the next community meeting?

please bring it up during the next community meeting, would be best! Thanks

@JohanLorenzo
Copy link
Author

I took another stab at it now that #191 is implemented 😃

Copy link
Contributor

@matt-boris matt-boris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this problem is solved with minimal code changes to worker manager to allow for a secondary worker config to be used when provisioning workers to fill the minCapacity void. This allows us to just use what's already there.

This could look like allowing a second block of worker config in the worker pool config or even just have worker manager automatically edit the idleTimeoutSecs to 0 and make it an on-demand instance type for the workers they're provisioning to get the capacity to minCapacity.

@JohanLorenzo JohanLorenzo force-pushed the rfc-192 branch 2 times, most recently from 4f9302d to 119cdb2 Compare September 26, 2025 16:57
@JohanLorenzo
Copy link
Author

Thank you very much @lotas and @matt-boris for your insightful input! I like where the RFC is going. It's now much simpler 👍 I believe I addressed all the comments and it's now ready for another round of review. Please let me know if there's anything I missed.


#### 2. Lifecycle-Based Worker Configuration

Workers are assigned `minCapacity` or overflow roles at launch time and never change configuration during their lifetime. This approach works within TaskCluster's architectural constraints where worker-manager cannot communicate configuration changes to running workers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could keep it simple with a single boolean, we only care if the worker is launched as min capacity (forever-worker) or just a regular
is_min_capacity or is_going_to_work_forever 😃

## Error Handling and Edge Cases

**Worker Lifecycle Management:**
- **Pool reconfiguration**: Capacity changes trigger worker replacement, not reconfiguration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a very important edge-case for the long-running (or forever running workers)
If launch config is changed /removed/ etc (either for cost reasons, or if we want to upgrade instance type or image itself) what do we do?

My intuition says if launch config is being archived (not present in new config) and there are long running workers created from it, workers should be killed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a section to cover this edge case. Let me know what you think!

#### 3. Pool Configuration

**New Configuration Options:**
Pools will have a new boolean flag `enableMinCapacityWorkers` that enables minCapacity worker behavior. When enabled, workers fulfilling minCapacity requirements will run indefinitely (idle timeout set to 0), providing maximum cache preservation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really want an extra flag for this. For me minCapacity is already there.
Unless someone wanted to use it the way they were currently working, I think we can just not introduce a new flag.
just thinking out loud

Copy link
Author

@JohanLorenzo JohanLorenzo Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of a new flag in order to roll out the changes. But maybe there's another way I didn't think of. I'm open to other options 🙂

lotas
lotas previously approved these changes Oct 6, 2025
Copy link
Contributor

@lotas lotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this proposal is in really good shape from my point of view 👍 thank you

@JohanLorenzo
Copy link
Author

@jcristau, I remember you had some thoughts on the current behavior. What do you think of this proposal?

Copy link
Member

@petemoore petemoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation for splitting a worker pool into two groups - the workers that stay alive forever, and those that don't, but I see some downsides to this approach. I think the reason this solution is proposed is because currently workers decide themselves when to terminate, rather than being signaled to do by a central authority (e.g. Worker Manager). However, at this point, Worker Manager is now effectively managing termination of some instances and not others, and under this approach, the responsibility is somewhat artificially split, with neither party (Worker Manager and the the worker itself) having control of the entire pool. This leads to some unpleasant consequences, such as worker manager needing to hack the worker config generation (the worker config is currently embedded in the launch config, and will not map to what the worker config actually looks like on the worker, since the idle timeout parameter only applies to a subset of the workers). Note, different worker implementations may also have different config settings for the idle timeout, so Worker Manager needing to understand the config of different worker implementations is also very brittle. It also doesn't allow Worker Manager to be smart about which workers it wants to keep alive, and those that it wants to terminate, since it can only terminate the workers in the "stay alive forever" group, which might be suboptimal (perhaps it is preferential to retain a different subset of workers that are cheaper/more performant). For example, prices might drop for one instance type, but you have workers set to run forever with a different instance type, so you are stuck with the ones you have.

To solve all of this, since Worker Manager is now becoming responsible for managing the termination of some workers, I think it would be better to move the entire responsibility to Worker Manager, and avoid the hacking of the Generic Worker config. I would propose that rather than quarantining the worker (which is intended for a different use case) and then later shutting it down, that a signal of some sort would be sent that signals that the worker should gracefully terminate. This way we do not muddy the Quarantine logs with events that are for terminations, not for quarantines (e.g. auditing real quarantines becomes problematic when the audit logs are full of business-as-usual terminations).

Consider also that technically a worker can perform work for a worker pool even if it wasn't launched by Worker Manager. Worker Manager may know it exists (since the Queue will report the worker as being active) and it may contribute towards the pool size, but maybe Worker Manager does not have a means to terminate it. My point here is that it gets very messy very quickly trying to account for all scenarios in this proposed design. I think it would be much simpler and cleaner instead for Worker Manager to make decisions about all the workers it launches, and move the responsibility to Worker Manager in a worker-implementation-agnostic way (Worker Manager should not need to know what the worker implementation config settings are or how to patch them) and at a worker pool level we should be able to tell Worker Manager "keep this many workers alive" and it should be able to signal to workers it has created that it no longer needs them, whenever it wants, and use this simple mechanism to manage its pool, not just a subset of the workers it has created.

So in summary, I think we should:

  1. When generating worker pool definitions for Generic Worker workers in community-tc-config, fxci-config etc, set idle time to 0 (the config option can continue to be supported by Generic Worker, but just not used in our Worker Manager managed worker pools, as it is potentially still useful for when the worker isn't spawned by Worker Manager).

  2. introduce a signalling mechanism that Worker Manager can use to advise the worker to stop taking new jobs, and to shutdown (similar to quarantine but more explicit that it should be followed by a shutdown). Avoid hijacking the existing call which is really intended for auditing (rare) security incidents, troubleshooting unexpected behaviour etc.

  3. Worker Manager implements some criteria to decide which workers to terminate and which to keep alive when pool size is greater than desired. Initially this could be just keeping the workers around that have been alive the longest, but there could be some scoring mechanism which favours long lived workers but also measures performance, tracks anomolies, or unexpected expired task claims, etc. Maybe there is a cap on maximum lifetime of a worker for security reasons (e.g. don't allow any worker to live longer than 2 weeks) which may also ensure images are not too old (e.g. if the machine image it is built from becomes out of date, it should not be kept indefinitely). There are several reasons why your initial choice about which workers to keep alive indefinitely might change, and having a homogeneous pool where "all workers are equal" in the ability to keep workers alive, gives Worker Manager more power to make smarter decisions, avoid muddying configs, and keeping split of responsibilities clear.

@lotas
Copy link
Contributor

lotas commented Nov 3, 2025

If I understand correctly @petemoore, you suggest:

  1. Add a feature to g-w (worker-runner) to ask worker-manager if it should keep on doing work or should terminate itself (as soon as appropriate)
  2. let worker-manager scale down automatic pools.

Aren't they competing solutions?

I like the idea of making worker manager not aware of worker configs, probably less error-prone in the long run 👍

I think there could be edge cases on both sides of those solutions, for example, if we make all workers long-lived ones, and worker-manager fails in some way to provision/scan (like expired credentials or internal error), which would result in many workers running way longer than they should.

Johan's initial proposal is probably a good temporary fix to let some workers live longer, but implementing some sort of signaling mechanism where workers talk to worker manager to know if they should keep going instead of shutting down might be better in the long run.

@petemoore
Copy link
Member

petemoore commented Nov 3, 2025

I wrote a lot of words, but in summary, I think I’m suggesting the approach is ok in principle, but should be for the whole pool, instead of splitting it in two.

This way, Worker Manager is free to decide which instances to terminate, based on its own conditions, which may change over time.

With Worker Manager managing the termination of workers, we can either turn off the feature on the workers directly in the Generic-Worker config, or have some longer fallback period, to catch situations like the one you highlight.

Regarding how Worker Manager communicates with the worker to ask it to gracefully shutdown, I am reasonably agnostic. Perhaps the quarantine followed by a termination isn't the worst thing, although it doesn't really allow workers to perform any cleanup (e.g. imagine in future that workers might want to e.g. preserve their caches by writing them to a storage bucket before terminating, or securely erasing secrets, or writing an activity report somewhere, reporting lifecycle events to some audit trail, or any other teardown type activity that might be useful...).

In any case, I think any time Worker Manager decides it no longer requires an instance, there should be some way to signal it to a worker, so that it can gracefully terminate, but Worker Manager should keep track of the instance and forcefully kill it after some timeout, if the worker hasn't successfully terminated itself.

@lotas
Copy link
Contributor

lotas commented Nov 3, 2025

Worker manager already can removeWorker() which will tell instance to shutdown, and worker-runner/generic-worker would get notified when this happens. But I guess it comes without any warning to the instance that is about to shutdown - something like spot termination is doing where you can potentially upload stuff before shutting down.

If there was some similar way to send SIGTERM signal to a vm and let g-w wrap things up before getting hard kill event that would make it easier probably

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants